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Abstract 

In unsupervised classification, Hidden Markov Models (HMM) are used to account for a 
neighborhood structure between observations. The emission distributions are often supposed to 
belong to some parametric family. In this paper, a semiparametric modeling where the emission 
distributions are a mixture of parametric distributions is proposed to get a higher flexibility. 
We show that the classical EM algorithm can be adapted to infer the model parameters. For 
the initialisation step, starting from a large number of components, a hierarchical method to 
combine them into the hidden states is proposed. Three likelihood-based criteria to select the 
components to be combined are discussed. To estimate the number of hidden states, BIC-likc 
criteria are derived. A simulation study is carried out both to determine the best combination 
between the merging criteria and the model selection criteria and to evaluate the accuracy of 
classification. The proposed method is also illustrated using a biological dataset from the model 
plant Arabidopsis ihaliana. A R package HMMmix is freely available on the CRAN. 



1 Introduction 



Hidden Markov models (HMM) constitute an efficient technique of unsupervised classification for 



longitudinal data. HMM have been applied in many fields including signal processing (Rabiner 



1989), epidemiology (Sun and Cai, 2009) or genomics (Li et al. 2005 Durbin et al. 1998). In 



such models, the neighbourhood structure is accounted for via a Markov dependency between the 
unobserved labels, whereas the distribution of the observation is ruled by the so-called 'emission' 
distribution. In most cases, the emission distributions associated with the hidden states are given 
a specific form from a parametric class such as Gaussian, Gamma or Poisson. This may lead to a 
poor fit, when the distribution of the data is far from the chosen class. Efforts have been made to 
propose more complex models capable of fitting skewed or heavy-tailed distribution, such as the 



multivariate normal inverse Gaussian distribution proposed in (Chatzis, 2010). However the case 
of multimodal emission distributions has been little studied. 

In the framework of the model-based clustering, where no spatial dependence of the latent 
variable is taken into account, a great interest has been recently paid to the definition of more flexible 



models using mixture as emission distributions (Li, 2005, Baudry et al. 2008). This approach can 



be viewed as semi-parametric as the shape of the distribution of each component of these mixtures 
is hoped to have a weak influence on the estimation of the emission distributions. The main 
difficulty raised by this approach is to combine the components. To achieve this step, classical 
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clustering algorithms are generally used: fc-means approach (Li, 2005) or hierarchical clustering 



(Baudry et al. , 2008). For a general review on the merging problem of Gaussian components in the 
independent case, see (Hennig, 2010). To the best of our knowledge, these approaches have been 
limited to independent mixture models until now. 

In this paper, we propose to extend this semi-parametric modeling to the HMM context. We 
first show that the inference can be fully achieved using the EM algorithm (Section [2]). Section 



2.3 is dedicated to the intialization of the EM algorithm that aims at merging components into 



the hidden states by considering an HMM version of the hierarchical algorithm of (Baudry et al. 



2008). We then consider the choice of the number of hidden states and proposed three BIC-like 



criteria (Section[3]). Based on a simulation study presented in Section[4j the best merging criterion is 
chosen. Eventually, the proposed method is applied to probe classification in ChlP-chip experiment 
for the plant Arabidopsis thaliana (Section [5]). An R package HMMmix is freely available on the 
CRAN. 



2 Model and inference 

This section describes the collection of models considered to fit the data distribution and the E-M 
algorithm used to estimate the parameter vector of each model. 

2.1 Model 

We assume that the observed data X = {X\, ...,X n }, where Xt G MP, are modeled with an HMM. 
The latent variable {St} is a D-state homogeneous Markov chain with transition matrix n = [itdd?] 
and stationary distribution q. The observations {Xt} are independent conditionally to the hidden 
state with emission distribution ipd (d = 1, D): 

(X t \St = d)~^ d . (1) 

We further assume that each of the emission distributions ipd is itself a mixture of parametric 
distributions: 

K d 

ipd = ^2^dk4>(-;idk), (2) 

k=l 

where Xdk is the mixing proportion of the k-th component from cluster d (\/k G {1, ...,Kd}, < 
Xdk < 1 and Xdk = 1) and </>(.; 7) denotes a parametric distribution known up to the parameter 
vector 7. We denote by 9 the vector of free parameters of the model. 

For the purpose of inference, we introduce a second hidden variable {Zt}t which refers to the 
component k within state d, denoted (dk). According to ([I]) and ([2]), the latent variable {Z t } is 
itself a Markov chain with the transition matrix £2 = [^uk),(d'k')\ 1 where 

U(dk),(d'k') = Kdd'^d'k', (3) 

so the transition between (dk) and (d'k 1 ) only depends on the hidden state S at the previous time. 

According to these notations, a model m is defined by D and the -D-uplet (K%, . . . , Kjj) specify- 
ing the number of hidden states and the number of components within each hidden state. Finally in 
the context of HMM with mixture emission distributions, we consider a collection of model defined 
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by 



M = {m := (D,Ki,...,K D )] 

1 < D;\/dK d > 1 with ^K d := K}. 



D 



In this paper, we leave the choice of K to the user, provided that it is large enough to provide a 
good fit. 

2.2 Inference for a given model 

This section is devoted to the parameter vector estimation of a given model m of the collection M . 
The most common strategy for maximum likelihood inference of HMM relies on the Expectation- 



Maximisation (EM) algorithm (Dempster et al. 1977, Cappe et al. 2010). Despite the existence 
of two latent variables S and Z, this algorithm can be applied by using the decomposition of the 
log-likelihood 

logP(X) = E x [logP(X,S,Z)]-E x [logP(S,Z\X)] 

where Ex stands for the conditional expectation, given the observed data X. 
The E-step consists in the calculation of the conditional distribution P(S, Z\X) using the cur- 
rent value of the parameter 6 h . The M-step aims at maximizing the completed log- likelihood 
E x [log P(X, S, Z)], which can be developed as 

E x [log P(X,S,Z)} = E x [logP(S)} +E x [log P(Z\S)} 

+E x [log P(X\S, Z)\ 



where 



D 



n D 

EE 

t=l d,d'=l 



Vtdd' log 7T dd > 



denoting 



E x [log P(S)} = £Vi d log<z(5i = d) + 
d=i 

D K d n 

E x [\ogP{Z\S)} = J2Y,Z~2 Ttddtdklo ^ Xdk 

d=l k=l t=l 

D K d n 

E x [log P(X\S,Z)} =J2Y^J2r td S tdk log<f, d (X t ,' rdk ) 
d=i k=\ t=i 



Ttd = P(S t = d\X) 
Vtdd' = P{S t = d, S t+ i = d'\X) 
5 tdk = P{Z t = dk\S t = d,X) 



E-Step. As P(S, Z\X) = P(S\X)P(Z\S, X), the conditional distribution of the hidden variables 
can be calculated in two steps. First, P(S\X) is the conditional distribution of the hidden Markovian 



state and can be calculated via the forward-backward algorithm (see Rabiner, 1989, for further 
details) which only necessitates the current estimate of the transition matrix Tl h and the current 
estimates of the emission densities at each observation point: tp^(Xt). This algorithm provides 
the two conditional probabilities involved in the completed log-likelihood: r td and r\tdd'- Second, 
P(Z t \S t = d,X) is given by 

\ d k<t>{X t ,j d k) 



Hdk 
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M-Step. The maximization of the completed log-likelihood is straightforward and we get 



n 



^dk 



t=l 

Ylt=l T td$tdk 
l^t=\ T td 



% k = argmax ^ ? td ^ 5 tdk log <f>{x t ) Jdk)- 



ldk t=i k=i 



2.3 Hierarchical initialization of the EM algorithm 

Like any EM algorithm, its behavior strongly depends on the initialization step. The naive idea 
of testing all the possible combinations of the components leads to intractable calculations. We 



choose to follow the strategy proposed by (Baudry et al. 2008), which is based on a hierarchical 



algorithm. At each step of the hierarchical process, the best pair of clusters k, I to be combined is 
selected according to a criterion. To do this, we define three likelihood-based criteria adapted to 
the HMM context: 

V£ =E x [logP(X;G' kul )], 

V*' 5 = E x [logP(X,S;G' kul )], (4) 
V% Z = E x [logP(X,Z;G' kul )} 

where G' klJl is the G — 1 clusters obtained by merging the two clusters k and I from the model with 
G clusters (D < G < K). It is assumed that the hierarchical algorithm is at the G-th step and 
therefore the term 'cluster' refers to either a component or a mixture of components. Two clusters 
k and I are merged if they maximise one of the merging criteria V&j: 

(k, I) = argmax V M . (5) 
fe,/e{i,...,G} 2 

Once the two clusters k and I have been combined into a new cluster A/, we obtain a model with 
G — 1 clusters where the density of the cluster k' is defined by the mixture distributions of clusters 
k and I. Due to the constraints applied on the transition matrix of Zt, the resulting estimates of 
the model parameters do not correspond to the ML estimates. To get closer to a local maximum, a 
few iterations of the EM algorithm are proceeded to increase the likelihood of the reduced model. 
The algorithm corresponding to the hierarchical procedure described above is given in Appendix 

m 



3 Selection Criteria for the number of hidden states 

We recall that the choice of K is left to the user. Given K and D, the EM algorithm is initialized 
by a hierarchical algorithm. In many situations, D is unknown and difficult to choose. To tackle 
this problem, we propose model selection criteria, derived from the classical mixture framework. 

From a Bayesian point of view, the model mfM maximizing the posterior probability P(m\X) 
is to be chosen. By Bayes theorem 
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and supposing a non informative uniform prior distribution P{m) on the models of the collection, 
it leads to P(m\X) oc P(X\m). Thus the chosen model satisfies 

rh = argmaxi- > (X|m), 

meM 

where the integrated likelihood P(X\m) is defined by 



P(X\m) 



P(X\m,9)ir(0\m)d0, 



ir(9\m) being the prior distribution of the vector parameter 9 of the model m. Since this inte- 
grated likelihood is typically difficult to calculate, an asymptotic approximation of 2 ln{P(X\m)} 
is generally used. This approximation is the Bayesian Information Criterion (BIC) defined by 



BIC(m) = log P(X\m,9) 



log(ra). 



(6) 



where v m is the number of free parameters of the model m and P(X\m, 9) is the maximum likelihood 



under this model (Schwarz, 1978). Under certain conditions, BIC consistently estimates the number 
of mixture groups (Keribin 2000). But, as BIC is not devoted to classification, it is expected to 



mostly select the dimension according to the global fit of the model. In the context of model-based 



clustering with a general latent variable U, (Biernacki et al. 2000) have proposed to select the 



number of clusters based on the integrated complete likelihood P(X, U\m, i 



P(X,U\m) 



P(X,U\m,9)Tr(9\m)d9. 



A BIC-like approximation of this integral leads to the so-called ICL criterion: 

ICL(m) = log P(X,U\m, 9) - ^log(ra), 



(7) 



(8) 



where U s tands for posterior mode of U . This definition of ICL relies on a hard partitioning of the 



data and (McLachlan and Peel., 2000) proposed to replace U with the conditional expectation of 



U given the observation and get 

ICL(m) 



E 



x 



log P(X,U\m, 



log P(X\m, 9) -H X (U) 



log(ra) 
^log(n). 



(9) 



Hence, ICL is equivalent to BIC with an additional penalty term, which is the conditional entropy of 



the hidden variable Hx(U) = —Ex 



This entropy is a measure of the uncertainty 



logP(U\X,m,\ 

of the classification. ICL is hence clearly dedicated to a classification purpose, as it penalizes 
models for which the classification is uncertain. One may also note that, in this context, ICL 
simply amounts at adding the BIC penalty to the completed log-likelihood, rather than to the 
log-likelihood itself. ICL has been established in the independent mixture context. Nonetheless, 



(Celeux and Durand, 2008) used ICL in the HMM context and showed that it seems to have the 



same behaviour. 

In our model, the latent variable U is the couple (S, Z), and a direct rewriting of Q leads to 



ICL(m) = log P(X\m, 9) -Hx(S,Z) 



log(n) 



(10) 
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and the conditional entropy of U = (S,Z) can further be decomposed as 



n x (s,z) = nx(s)+E x [nxM z )}, 

which gives raise to two different entropies: %x{S) measures the uncertainty of the classification 
into the hidden states whereas Kx[Hx,s(Z)] measures the classification uncertainty among the 
components, within each hidden states. This latter entropy may not be relevant for our purpose 
as it only refers to a within-class uncertainty. We therefore propose to focus on the integrated 
complete likelihood P(X, S\m, 6) and derive an alternative version of ICL, where only the former 
entropy is used for penalization, defined by: 



ICL s (m) = log P(X\m, 6) -%x{S) 



log(ra). 



(11) 



These three criteria, BIC, ICL and, ICL$, display different behavior in independent mixtures 
and in HMM. In the independent case, the number of free parameters v m only depends on K and 
the observed likelihood log P(X\m, 6) remains the same for a fixed K, whatever D. The BIC 
and the ICL given in Equations Q, (10) are thereby constant whatever the number of clusters. 
Moreover, the ICL$ always increases with the number of clusters so none of these three criteria can 
be used in the independent mixture context. On the contrary, in the case of HMM, the observed 
likelihood log P(X\m, 6) varies with the number of hidden states. Furthermore, because the number 
free parameters does depend on D through the dimension of the transition matrix, the number of 
free parameters of a D-state HMM differs from that of a {D — l)-state HMM, even with same K. 
This allows us the use of these three criteria to select the number of clusters. 

If we go back to the merging criteria of the hierarchical initialization of the EM algorithm 
defined in 12.31 We note that each criterion V is related to one model selection criterion defined 
above. Indeed, the maximisation of V is equivalent to: 



argmax V M 
k,ie{i,...,G} 2 

argmax V fcZ 

k,le{l,...,G} 2 

argmax V fcZ 
k,l£{l,...,G} 2 



argmin [BIC(G) - BIC(G' kul )] 



k.i 



argmin [lCL s (G) - ICL s (G' h 



k.i 



kui) 



argmin [ICL(G) - ICL(G' kul )] . 

k,i 



4 Simulation studies 

In this section, we present two simulation studies to illustrate the performance of our approach. In 



Section 4.1 we aim at determining the best combination between the selection criteria (BIC, ICLg, 
ICL) and the merging criteria (V^, V x,s , X7 X,Z ). In Section 4.2, we compare our method to that 



of (Baudry et al. , 2008) and we focus on the advantage of accounting for Markovian dependency in 



the combination process. 



4.1 Choice of merging and selection criteria 
4.1.1 Design 



The simulation design is the same as that of (Baudry et al. 2008| ), with an additional Markovian 



dependency. We simulated a four-state HMM with Markov chain {St}t- The emission distribu- 
tion is a bidimensional Gaussian for the first two states and is a mixture of two bidimensional 



6 



Gaussians for the other two. Therefore there are six components but only four clusters. In or- 
der to measure the impact of the Markovian dependency on posterior probability estimation, we 
considered four different transition matrices such that Vd 6 {1,...,4}, P(St = d\St-i = d) = a, 
with a G {0.25,0.5,0.75,0.9} and P(S t = d'\S t -i = d) = (1 - a)/3 for d / d'. The degree of 
dependency in the data decreases with a. To control the data shape, we introduce a parameter 
6 in the variance-covariance matrices X^. of the k-th bidimensional Gaussian distributions such as 



The parameter 6 takes its values in {1, 3, 5, 7} where 6 = 1 corresponds to 

0"fcl2 Cfc2 J 

well-separated clusters (Baudry et al. , 2008, case of) and 6 = 7 leads to overlapping clusters. Figure 
[T] displays a simulated dataset for each value of 6. The mean and covariance parameter values are 
given in Appendix [Bl 





Figure 1: Example of simulated data. Top left: 6=1, easiest case, corresponds to the one studied 



by ( jBaudry et al. 2008); top right: 6 = 3; bottom left: 6 = 5; bottom right: 6 = 7. Each group is 



represented by a symbol/color. 

For each of the 16 configurations of (a, 6), we generated C = 100 simulated datasets of size 
n = 800. For the method we proposed, the inference of the If -state HMM has been made with 

spherical Gaussian distributions for the emission, i.e. the variance-covariance matrix is f ^ ^ 2 

The performance of the method is evaluated by both the MSE (Mean Square Error) and the 
classification error rate. The MSE of the conditional probabilities measures the accuracy of a 
method in terms of classification. It evaluates the difference between the conditional probability 
estimation of a criterion V and the theoretical probability 

M^) = ^I^||.f)- T f)||, (12) 

c=l t=l 

The smaller the MSE, the better the performance. Since our aim is to classify the data in a given 
number of clusters, another interesting indicator is the rate of correct classification. This rate allows 
the consistency of the classification to be measured, with respect to the true one. We calculated 
this rate for each simulation configuration where the classification has been obtained with the MAP 
(Maximum A Posteriori) rule. 
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4.1.2 Merging criteria 

The goal is to study the best way to merge the clusters of an HMM. Therefore, we compare the 
three merging criteria with regard to the MSE (see Figure [2]) . 

Dependency contribution. When the data are independent (a = 0.25), we note that the V x 
and X7 X ' Z criteria provide a bad estimation whatever the value of b. The results obtained with V ' 
are satisfying only if the groups are not overlapping (b = 1). These poor results can be explained 
by the definition of the merging criteria that are not suited to the independent case. 



a=0,25 a-0,5 




Figure 2: MSE values for each simulation condition. Each graph corresponds to a transition matrix 
which goes from low dependency level (top left) to high (bottom right), "o": V x , "x": V x,s , "0": 
V x ' z . 

From a general point of view, increasing the value of a yields a better estimation of the con- 
ditional probabilities. When a ^ 0.25, the criterion provides the best results in most cases. 
In high dependency cases (a = 0.75 or 0.9), the results are similar whatever the merging criterion. 
However, for a = 0.5, the V x,z generates estimations far from the true ones even if the groups are 
easily distinguishable (b = 1,3). Further simulation studies (not shown) point out that the V x 
criterion outperformed the others as soon as a > 0.4. 

Effect of the overlap. When a = 0.5, the criteria V"^ and V^' produce similar estimates when 
the groups are well separated (b = 1,3). Increasing the overlap between the groups (b = 5,7) has 
very little influence on the results provided by V^" but is harmful to the V X ' S . When the degree 
of dependency increases (a = 0.75 or 0.9), the criterion still gives the best results whatever the 
value of b. 

Figure [3] describes the variation in the standard deviation of the MSE for each simulation 
condition. Once more the has the best results among the four methods, especially for high 
dependency level. 

Table [T] shows the rate of correct classification with its standard deviation. When a = 0.25, the 
V X ' outperformed the other criteria whatever the value of b. For all other cases (a ^ 0.25), 
merging the components with \7 X ' S allows close or better results than those obtained by the 



S 



Figure 3: Standard deviation of the MSE for each simulation condition. Each graph corresponds 
to a transition matrix which goes from low dependency level (top left) to high (bottom right), "o": 
V x , "x": V x ' s , "0": V x ' z . 



when b either equals 1 or 3. However, when the case is more complicated (6 = 5, 7), the V really 
outperforms. The \7 X ' Z provides the worst results among the three cases proposed. The difference 
between the methods is more flagrant on the MSE values (see Figure [2]). According to the above 
results, we propose the V"^ criterion for merging the clusters. 



Study of the combination of the merging and selection criteria. We now focus on the 
estimation of the number of clusters, which equals 4 for the simulation design given in Section 
4.1.1 We compare the three selection criteria proposed in Section [3] and we study the estimated 



number of clusters for each simulation condition. Table [2] provides the rate of good estimations 
of the number of clusters. This rate is calculated for each dependency level and for each value of 
b. First of all, considering the ICL as a selection criterion does not lead to a good estimation of 
the number of clusters. This can be explained by the fact that ICL involves the latent variable Z 
which is linked to the components. Hence, the ICL tends to overestimate the number of clusters. 
It is more reliable to estimate the number of clusters with a criterion which does not depend on Z 
such as the BIC or the ICLs- As shown in Table [2] the best criterion for estimating the number 
of clusters is ICLs- 



4.1.3 Conclusion 

We proposed three different criteria for combining the clusters and we showed that the V"^ seems 
to outperform the other criteria when the aim is merging the components. In fact, it provides esti- 
mation of the conditional probabilities close to the true ones and these estimations are very robust 
in terms of MSE. Moreover, this is also confirmed by studying the rate of correct classification. 
For the estimation of the number of clusters, ICLs seems to be the most accurate. To conclude, 
we proposed using V"^ as the merging criterion and estimating the number of clusters by ICLs- 
Throughout the remainder of the paper, this strategy is called "HMMmix". 
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Table 1: Rate of correct classification and its standard deviation (mean(sd)). Top: low dependency. 
Bottom: high dependency. The values in bold correspond to the best rate of correct classification 
calculated for the three criteria. 



Parameters V x V X ' S V X ' Z 

m b=l 0.479 (0.0076) 0.958 (0.0087) 0.456 

g b=3 0.457 (0.0065) 0.662 (0.0112) 0.427 

II b=5 0.458 (0.0066) 0.542 (0.0043) 0.422 

15 b=7 0.444 (0.0066) 0.482 (0.0095) 0.409 



0.0055) 
0.005) 
0.005) 
0.0055) 



b=l 0.969 (0.0051) 0.996 (0.0003) 0.871 

b=3 0.913 (0.0054) 0.928 (0.0071) 0.652 

b=5 0.840 (0.0082) 0.694 (0.0106) 0.615 

b=7 0.776 (0.0089) 0.575 (0.0079) 0.564 



0.0122) 
0.0119) 
0.0104) 
0.0085) 



b=l 0.998 (0.0002) 0.998 (0.0002) 0.997 

b=3 0.974 (0.0007) 0.975 (0.0007) 0.962 

b=5 0.940 (0.0013) 0.940 (0.0027) 0.926 

b=7 0.907 (0.0020) 0.859 (0.0090) 0.887 



0.0004) 
0.0021) 
0.0041) 
0.0041) 



b=l 0.995 (0.0032) 0.998 (0.0001) 0.991 

b=3 0.989 (0.0005) 0.989 (0.0004) 0.989 

b=5 0.974 (0.0010) 0.9 75 (0.0009) 0.972 

b=7 0.955 (0.0013) 0.955 (0.0012) 0.952 



0.0052) 
0.0005) 
0.0029) 
0.0021) 



Table 2: Rate of correct estimations of the number of clusters for each simulation condition. The 
merging is done by the criteria. Top: low dependency. Bottom: dependency. The values in 
bold correspond to the best rate of correct classification calculated for the three methods. 



Parameters BIC ICL S ICL 



a b=l 0.88 0.80 0.69 

^ b=3 0.97 0.85 0.38 

II b=5 0.98 0.87 0.38 

b=7 0.98 0.92 0.21 



b=l 0.88 0.96 0.49 

b=3 0.89 0.98 0.16 

b=5 0.95 0.99 0.12 

b=7 0.96 1 0.05 



. a b=l 0.92 1 0.62 

^ b=3 0.96 1 0.17 

II b=5 0.96 1 0.15 

b=7 0.97 1 0.11 



b=l 0.92 0.96 0.50 

b=3 0.96 1 0.29 

b=5 1 1 0.26 

b=7 0.98 1 0.16 
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Table 3: MSE values (mean(sd)) obtained with our method HMMmix and the method of (Baudry 
et al.[ |2008[) 





Parameters 


riiviivimix 


rSaudxy et at. [Z\J\Jo) 




b=l 


0.887 (0.014) 


0.005 (0.0087) 


d 


b=3 


0.795 (0.013) 


0.144 (0.022) 




D — 


n 7no ni i \ 
U. i uy ^U.Ull j 


u.o±y ^u.uzz ) 


IS 


D — f 


U.D (0 J 


n ice /^n ni t\ 
U.ooo ^U.Ul/^ 


10 


b=l 


0.048 (0.007) 


0.006 (0.005) 


o 


b=3 


0.084 (0.008) 


0.169 (0.022) 


II 


b=5 


0.144 (0.014) 


0.385 (0.021) 




b=7 


0.198 (0.014) 


0.369 (0.018) 


10 


b=l 


0.003 (0.0002) 


0.013 (0.007) 


t> 
o 


b=3 


0.016 (0.0007) 


0.173 (0.022) 


II 
ci 


b=5 


0.035 (0.001) 


0.407 (0.020) 




b=7 


0.054 (0.002) 


0.421 (0.017) 


05 


b=l 


0.007 (0.005) 


0.003 (0.0002) 


o 


b=3 


0.009 (0.0007) 


0.229 (0.023) 


II 

IS 


b=5 


0.019 (0.001) 


0.413 (0.022) 




b=7 


0.034 (0.002) 


0.418 (0.017) 



4.2 Markovian dependency contribution 



In this second simulation study, by comparing the method proposed by (Baudry et al. 2008) to 



HMMmix, we are interested in taking into account the Markovian dependency. 



For computing the (Baudry et al. , 2008) approach, we used the package Mclust (Fraley and Raftery 



1999) to run the EM algorithm for the estimation of the mixture parameters. 



In the independent case (a = 0.25), the method of (Baudry et al. 2008) provides better estimation 
of the conditional probabilities than does the HMMmix (see Table [3]). The proposed method tries 
to find non-existent Markovian dependency, making it less efficient. 

Note that, 
same results. 



whatever the value of a, the method of (Baudry et al. 2008) logically provides the 
Regarding the proposed method, we find that the Markovian dependency (when it 
exists) is beneficial for the estimation of conditional probabilities. The interest of accounting for 
dependency in the hierarchical process stands out for the more complicated configuration, i.e. when 
the groups are overlapping (6^1). In this case, our method tends to be more robust. 



4.2.1 Two nested non-Gaussian clusters 



In this section, we focus on an HMM where the emission distribution is neither a Gaussian nor a 
mixture of Gaussian. We simulated datasets according to a binary Markov chain S with transition 
2/3 1/3 
1/3 2/3 

nested (see Figure EJ and correspond to the random variable X such as: 



matrix 



. Denote by C a = [—a; a} 2 the square of side length 2a. The two clusters are 



• (X t \St = 0)~U Vo , with D 

• (X t \S t = 1) ~U Vl , with £>! 



Ci. 

2 



C b+0 .2\C b and b G {0.7,0.55,0.52}. 



The parameter b represents the distance between the two groups: b € {0.02,0.05,0.2}. According 
to this simulation design, we simulated three different datasets (see Figure [4]). Figure [4] displays 
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Truth 



HMMmix 



Baudry 




Figure 4: The distance between the two squares goes from 0.2 (top) to 0.02 (bottom). Left: Truth, 



Middle: HMMmix, Right: Method of (Baudry et al. 2008) 



the classification given by our method and the one given by (Baudry et al. 2008), according to 



the distance between the two squares. Note that the spatial dependence cannot be observed on 
this figure. With our approach, the two nested clusters are well detected with a low number of 
misclassified which logically increases when the distance decreases. If Markovian dependency is not 



taken into account (method of Baudry et al. 2008), the two clusters are identified only when they 



are well separated. In such a complex design, geometrical considerations are not sufficient to detect 
the two clusters; accounting for dependency is therefore required. 



5 Illustration on a real dataset 

We now consider the classification of probes in tiling array data from the model plant Arabidopsis 
thaliana. This experiment has been carried out on a tiling array of about 150 000 probes per 
chromosome. The biological objective of the experiment is to compare the methylation of a specific 
protein (histone H3K9) between a wildtype and the mutant ddml of the plant. It is known that 
the over-methylation or under-methylation of this protein is involved in the regulation process of 
gene expression. As two adjacent probes cover the same genomic region it is required to take into 
account the dependency in the data. Due to computational time, which will be discussed in Section 
[6| we apply our method on a sub-sample of 5000 probes of Chromosome 4. We apply the HMMmix 
method starting with K = 40 components (see Figure [5j Left). The number of clusters given by 
ICLs is 8. Figure [5] (Right) displays the final classification. The cluster on the bottom left (in grey) 
represents the noise, i.e. the unmethylated probes. The four groups on the diagonal correspond 
to the probes which have the same level of methylation for the two conditions. These probes have 
either high (cyan) or low level (black) of methylation. The cluster on the left side of the diagonal 
(red) contains the over-methylated probes in the mutant compared to the wildtype, whereas the 
two clusters on the right side (green) correspond to the under-methylated ones. 
The estimated densities are represented in Figure [6] for each cluster. The histograms are built by 
projecting the data on the X-axis (corresponding to the wildtype) weighted with their posterior 
probabilities. We see that the empirical distributions are not unimodal. Considering a mixture of 
distributions clearly leads to a better fit than single Gaussian. 

The proportions of the over-methylated and under-methylated clusters are 6.5% and 15.5%, 
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Wildtype 



Wildtype 



Figure 5: Left: Representation of the initial HMM with K = 40 components. Right: Classification 
obtained after merging the components. Each color represents a specific cluster. 
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Figure 6: Representation of the fit of densities for each cluster. The colors of the estimated densities 
correspond to the colors of the clusters presented in Figure [5j 



respectively. This result seems to be consistent with the biological expectation (Lippman et al. 



2004). Moreover, in this dataset, there is one well-known transposable element named META1 
which is an under-methylated region. With our method, 75% of the probes of META1 are declared 
under- methylated. 

In conclusion, the final classification is easily interpretable with respect to the biological knowl- 
edge. Furthermore, the flexibility of our model makes it possible to better fit the non-Gaussian 
distribution of real data. 



6 Discussion 



In this article, we have proposed an HMM with mixture of parametric distributions as emission. 
This flexible modeling provides a better estimation of the cluster densities. Our method is based 
on a hierarchical combination of the components which leads to a local optimum of the likelihood. 
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We have defined three criteria and in a simulation study we have highlighted that the V is the 
best criterion for merging components and ICL$ for selecting the number of clusters. A real 
data analysis allows us to illustrate the performance of our method. We have shown that the 
clusters provided by our approach are consistent with the biological knowledge. Although the 
method is described with an unknown hidden states D and is illustrated with mixture of Gaussian 
distributions, we point out that tha same approach can be used when D is known or with other 
parametric distribution family in the mixtures. For the initial number of components K, a brief 
simulation study has shown that for a large enough value of K, the classification still remains the 
same. 

A remaining problem of HMMmix is the computational time, especially when the size of the 
dataset is greater than 10000. This is due to the calculation of the V x criterion which is linked to 
the observed log-likelihood. The computation of this observed log-likelihood requires the forward 
loop of the forward-backward algorithm whose complexity is linear in the number of observations. 
Otherwise, the number of models considered in the hierarchical procedure is ^2d = p d(d — l)/2and 
the computational time dramatically increases with K and is of order 0(nK 3 ) . Consequently, to 
decrease the computational time, the solution is to reduce the space of models to explore. This 
can be done by a pruning criterion based on an approximation of V x leading to a complexity at 
0{nK 2 ) 
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Appendix 

A Algorithm 

We present in this appendix the algorithm we proposed for merging components of an HMM. This 
algorithm has been written with respect to the results we obtained in Section 4.1.2 However, this 
algorithm can easily be written for other criteria. 

1. Fit an HMM with K components. 

2. From G = K,K 

• Select the clusters k and I to be combined as: 



(k, I) = argmax V^, 

MG{1,...,K} 2 

• Update the parameters with a few steps of the EM algorithm to get closer to a local 
optimum. 

3. Selection of the number of groups D: 

D = argmax ICLs(k) 
ke{K,...,i} 

= argmax log P(X, S\k, 9 k ) - ^log(n), 

k&{K,...,l} 1 
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B Mean and variance of the Gaussian distributions for the simu- 



lation study (Section 4.1) 
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